Single-cell RNA-seq data analysis

Introduction

Single-cell RNA-seq data analysis usually follows the steps shown in the figure below. The first step is the generation of the count matrices using the raw sequencing data. After that we start the quality control steps that can include removal of doublets, ambient RNA correction, and removal of low quality cells (i.e. high mitochondrial content (dying cells) and low number of reads (empty droplets)). Once we have a filtered and high quality dataset, we proceed to the pre-processing steps that include normalization, feature selection, dimensionality reduction, and clustering of the cells (explained with more details later).

Overview of analysis steps for scRNA-seq. Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet 24, 550–572 (2023).

For the sake of simplicity and time, we will skip the quality control steps that precede any analysis done with scRNA-seq data. If you want to know more about the quality control steps you can check here and here. Therefore, here we will use a dataset that has already been quality controlled and filtered.

Initial set-up

Set working directory

In this step, you set your working directory, that is, the folder in your computer where you are going to save your scripts, data and results.

path <- '/Users/nvribeiro/Library/CloudStorage/OneDrive-UMCG/Desktop/PhD/Others/BMS-BigData-Course-2025' # you need to change this to your own path

setwd(path)

Load libraries

If you haven’t done, check the file “Preparation for data analysis – Big Data Course BMS 2025” and follow the instructions to install/update R, RStudio and the packages that you will used in the analysis. Then load the libraries.

library(Seurat)
library(dplyr)
library(tidyr)
library(ggplot2)
library(patchwork)
library(stringr)
library(MAST)
library(ggrepel)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(knitr)
library(presto)
library(forcats)
# Optional: define a default theme to be used in all plots
default_theme <- function(){
  theme_bw() +
    theme(panel.grid = element_blank(),
          panel.border = element_rect(linewidth = 1),
          axis.title = element_text(size = rel(1)),
          axis.title.y.left = element_text(margin = margin(r = 10, unit = 'pt')),
          axis.title.y.right = element_text(margin = margin(l = 10, unit = 'pt')),
          axis.title.x.bottom = element_text(margin = margin(t = 10, unit = 'pt')),
          plot.title = element_text(size = rel(1), hjust = 0.5, face = 'bold'),
          axis.text = element_text(size = rel(1)),
          axis.ticks.length = unit(5, 'pt'),
          strip.background = element_rect(fill = "#f5a9a9", linewidth = 0),
          strip.text = element_text(size = rel(1), color = "black")
    )
}

Loading the tutorial data

We will use a small subset of 10,000 single-cells from small intestine biopsies of pediatric samples from the Gut Cell Atlas by Elmentaite at al. (2021). This dataset is included in the files you downloaded previously and is called “Elmentaite_2021_P_smallInt_BMS_tutorial.rds”.

obj <- readRDS("data/Elmentaite_2021_P_smallInt_BMS_tutorial.rds")
obj
An object of class Seurat 
33538 features across 10000 samples within 1 assay 
Active assay: RNA (33538 features, 0 variable features)
 2 layers present: counts, data

The code above loads the dataset and tells us that this is a Seurat object with 10,000 cells and 33,538 features (genes), and we are working with a RNA assay. Now we can start pre-processing this data and run some analysis.

Pre-processing

Single-cell RNA-sequencing is prone to many different sources of variance, for example cells can have very different numbers of gene counts because of differences in the amount of mRNA present in them or purely randomly during the sequencing. Normalization of counts make the transcriptome profile of all cells comparable among them. During this process, you also want to remove unwanted sources of variation, for example genes involved in cell cycle or mitochondrial genes. If you want to know more about normalization, you can read this.

Moreover, the counts matrix is huge and sparse (contains a lot of zeros), which means that not every detected gene is biologically informative. The features selection step is useful to reduce the computational power required and make sure the analysis focus only on informative genes. These genes ideally explain the biological variation in the data by prioritizing genes the vary between subpopulations, instead of within a subpopulation. These genes are then used to perform dimensionality reduction, which in this case is a principal component analysis (PCA). Using the PCA dimensions, we can then group cells into clusters based on how similar they are in this multidimensional space and visualize them in a UMAP representation.

The code below takes care of perform all the steps above. The resolution parameter in the function FindClusters() below is the one you need to pay more attention and adjust according to your dataset. A large value will lead to more clusters, and a small one, to fewer. The best way to guess the resolution parameter is to have some idea of how many cell types you expect in your sample based on the tissue you are working it, and then try different values until you find one that gives you a reasonable number of clusters. In this case, we are working with small intestine, a tissue with many different cell types, so we expect to have many different clusters. In the original publication, they have hundreds of different cell types but because we are working only with a small subset of the data and a low resolution to speed thing up a bit, so we will have much less.

options(future.globals.maxSize = 8000 * 1024^2)

obj <- obj %>% 
  NormalizeData() %>%
  FindVariableFeatures() %>%
  ScaleData(vars.to.regress = "pct_counts_mt") %>% # Regresses out the percentage of mitochondrial RNA
  RunPCA() %>%
  FindNeighbors(dims = 1:50) %>%
  FindClusters(resolution = 0.3) %>%
  RunUMAP(dims = 1:50)
Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.
Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.
Regressing out pct_counts_mt
Centering and scaling data matrix
PC_ 1 
Positive:  IGFBP7, CALD1, IFITM3, SPARC, COL1A2, COL3A1, A2M, C1S, RARRES2, COL6A2 
       SPARCL1, COL1A1, COL6A1, C1R, DCN, CXCL14, MFAP4, LUM, LGALS1, MMP2 
       TIMP1, TM4SF1, COL4A2, COX7A1, NUPR1, IGFBP4, TPM2, ADAMDEC1, MYL9, GPX3 
Negative:  IGHM, TCL1A, HLA-DQA1, DUSP2, KLRB1, TRBC1, CCL5, HMGA1, LRMP, PIM2 
       HLA-DPB1, MZB1, IGKC, GPR183, HLA-DPA1, SRGN, LINC01857, RGS13, NKG7, LINC01871 
       FAM30A, SPIB, ITGB2, IL7R, CST7, IGLC3, IGLC2, DERL3, TNFRSF17, TRGC2 
PC_ 2 
Positive:  ALDOB, FABP2, ANPEP, PRAP1, GUCA2A, DPEP1, FABP1, PHGR1, AMN, CDHR5 
       SMIM24, APOB, C19orf33, MEP1A, RBP2, SI, EPCAM, CES2, MTTP, PCK1 
       ACE, C3orf85, FABP6, AOC1, ENPEP, MGAM, KRT19, SERPINA1, CDHR2, MYO1A 
Negative:  VIM, CALD1, LGALS1, IGFBP7, COL6A2, RARRES2, COL1A2, COL3A1, C1S, SPARC 
       COL6A1, A2M, COL1A1, TIMP1, DCN, C1R, MMP2, CXCL14, TPM2, NUPR1 
       SPARCL1, LUM, IFITM3, MFAP4, TCF4, SELENOM, C11orf96, IGFBP5, SOD3, SERPING1 
PC_ 3 
Positive:  PLVAP, ADGRL4, PCAT19, VWF, RAMP2, RAMP3, ECSCR, CYYR1, EGFL7, CLDN5 
       CD34, JAM2, CDH5, PECAM1, FLT1, MMRN2, ESAM, CD93, PCDH17, PTPRB 
       TM4SF18, APLNR, PODXL, CALCRL, CD320, SOX18, ROBO4, FAM110D, MYCT1, TMEM88 
Negative:  DCN, COL3A1, COL1A2, LUM, MFAP4, CXCL14, C1S, ADAMDEC1, FBLN1, RARRES2 
       TCF21, COL1A1, CFD, EMILIN1, COL6A3, CCL11, MEG3, COL6A2, MMP2, CTSK 
       APOE, PDGFRA, C1R, CXCL1, CYGB, COL6A1, SPON2, HAPLN1, ABCA8, EDIL3 
PC_ 4 
Positive:  AGR2, KRT18, OLFM4, GPX2, PIGR, PPP1R1B, SLC12A2, CEACAM5, TFF3, REG4 
       LGALS4, FCGBP, CLCA1, SPINK1, MUC2, DMBT1, ST6GALNAC1, SPINK4, STARD10, CDC42EP5 
       REG1A, KRT8, CREB3L1, ITLN1, BCAS1, MGST1, ADH1C, HEPACAM2, FOXA3, SPDEF 
Negative:  APOB, CYP3A4, SLC6A19, APOA4, ACE, SLC15A1, CUBN, APOC3, ALPI, APOA1 
       CREB3L3, GUCA2A, XPNPEP2, MGAM, MTTP, C3orf85, DPEP1, PRAP1, ANPEP, CDHR2 
       GUCA2B, SLC5A12, SLC26A3, SMIM24, SULT1A2, IL32, NAALADL1, CPO, MEP1B, MS4A10 
PC_ 5 
Positive:  PCLAF, MKI67, TOP2A, STMN1, UBE2C, HMGB2, BIRC5, NUSAP1, AURKB, RRM2 
       TYMS, PTTG1, CDK1, CDKN3, CENPF, TK1, CCNA2, GTSE1, TPX2, CDC20 
       TUBB, RGS13, CENPA, CKS2, MND1, HMMR, ASPM, MYBL2, H2AFZ, CDT1 
Negative:  CLCA1, MUC2, FCGBP, REG4, BCAS1, TFF3, SPINK4, ZG16, HEPACAM2, REP15 
       S100P, CEACAM5, ATOH1, MLPH, SPDEF, CREB3L1, TFF1, SCNN1A, ITLN1, FOXA3 
       ST6GALNAC1, LRRC26, CEACAM6, STARD10, ZG16B, FXYD3, TPSG1, NPDC1, TSPAN1, AC005833.1 
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10000
Number of edges: 433662

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9557
Number of communities: 21
Elapsed time: 0 seconds
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:37:26 UMAP embedding parameters a = 0.9922 b = 1.112
15:37:26 Read 10000 rows and found 50 numeric columns
15:37:26 Using Annoy for neighbor search, n_neighbors = 30
15:37:26 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:27 Writing NN index file to temp file /var/folders/x6/85llj18n3j18wmy0ylhx6pdm0000gn/T//RtmpQEBUwl/file142756bd697d
15:37:27 Searching Annoy index using 1 thread, search_k = 3000
15:37:28 Annoy recall = 100%
15:37:29 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:37:29 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
15:37:29 Using 'irlba' for PCA
15:37:30 PCA: 2 components explained 32.73% variance
15:37:30 Scaling init to sdev = 1
15:37:30 Commencing optimization for 500 epochs, with 438422 positive edges
15:37:30 Using rng type: pcg
15:37:37 Optimization finished
DimPlot(obj, reduction = "umap")

Inspecting the clustering

Large experiments usually are done in batches and that can be a cofounder that we don’t want to include in our analysis. If there is a strong batch effect, cells from the same cell type will cluster together based on their batch and not on their true identity. A quick way to check if you have a strong batch effect in your data is to plot the UMAP above but colored (or split) by batch. (You can also do this with any other variable that you have in your metadata)

DimPlot(obj, reduction = "umap", group.by = "batch")

# If you want to try the split version, un-comment (delete the '#") and run the line below
# DimPlot(obj, reduction = "umap", split.by = "batch") + NoLegend()

Identifying the clusters

Since we don’t see major batch effects, we can continue our analysis. Now, we want to identify the clusters. What cell types are they? To answer this question, we need to find the genes that define each cluster, or marker genes. These are genes that are differentially expressed in that cluster specifically when compared to all other cells. This is done using the code below. The function FindAllMarkers() performs a differential expression testing for each clusters versus all other cells, therefore, it might take a few minutes to run, depending on your computer’s specs. (You can skip this step if it takes too long or if your machine can’t handle it and go to to the cell identification step).

marker_genes <- FindAllMarkers(obj, 
                               only.pos = TRUE, # return only the positive differentially expressed genes
                               min.pct = 0.1, # only test genes that are detected in at least 10% of the cells in either cluster
                               logfc.threshold = 0.25) # the logFC of a gene between the two groups has to be at least 0.25

This gives you a large table with all the results. To make it easier to look at the markers, let’s reshape this and display only the top 10 most significant marker genes for each cluster.

# Getting the top 10 markers per cluster based on p-val and FC
top.markers <- marker_genes %>% 
  group_by(cluster) %>%
  filter(p_val_adj < 0.05) %>%
  arrange(p_val_adj, by.group = TRUE) %>%
  top_n(n = 10, wt = avg_log2FC)

# Reshaping the table
top.markers.clean <- top.markers %>%
  dplyr::select(cluster, gene) %>%
  group_by(cluster) %>%
  summarise(markers = str_c(unlist(pick(gene)), collapse=', '))
  
knitr::kable(top.markers.clean)
cluster markers
0 LINC00926, BANK1, FCER2, LINC02397, IGHD, GAPT, HVCN1, FAM129C, CD1C, TNFRSF13B
1 CCL5, KLRD1, TRGC2, CD160, GZMA, XCL1, XCL2, TRGC1, KLRC2, GZMK
2 APOB, APOA4, APOC3, APOA1, SLC5A12, MEP1B, SLC13A1, SLC2A2, ENPP7, FADS6
3 TIGIT, CD40LG, LEF1, ICOS, CTLA4, CHRM3-AS2, TOX2, CD28, TBC1D4, TNFRSF4
4 LUM, ADAMDEC1, CCL11, CXCL6, HAPLN1, FNDC1, CCL8, CCL13, SFTA1P, OGN
5 RRM2, CDK1, CLSPN, MND1, GTSE1, CENPA, CDCA5, SPC25, DLGAP5, FDCSP
6 OLFM4, GPX2, PLA2G2A, LGR5, GP2, REG3A, ITLN2, PRSS2, DEFA6, DEFA5
7 IGHA2, JCHAIN, LINC02362, AL391056.1, ENAM, AMPD1, IGLL5, IGHA1, LINC00582, FAM92B
8 FPR3, LILRB2, CD209, FCN1, CD163, LILRB5, SIGLEC1, S100A12, CD300E, TLR8
9 RGS13, SERPINA9, SUGCT, BORCS8-MEF2B, HTR3A, WDR66, CAMP, HRK, LINC00877, MIR3681HG
10 MUC2, CLCA1, FCGBP, ZG16, REP15, TFF1, ATOH1, CAPN9, B3GNT6, SYTL5
11 PLVAP, MMRN2, SOX18, FAM110D, PALMD, SOX17, SELE, LCN6, FCN3, CADM3-AS1
12 HLA-DQA2, MTRNR2L1, LINC00926, BANK1, IGHD, LINC02397, GBP4, FCRL1, C12orf42, HAPLN3
13 IGHG2, IGHGP, IGHG3, IGHG4, IGHG1, IGLV3-1, IGLV6-57, JSRP1, IGKC, IGLC3
14 ALKAL2, ENHO, POSTN, NPY, VEGFD, GLP2R, LY6H, VSTM2A, COL4A6, LINC01915
15 NOTCH3, COX4I2, HIGD1B, FAM162B, FOXS1, HEYL, FHL5, IL17B, AC018647.1, AVPR1A
16 ACTG2, HHIP, SOSTDC1, HSD17B6, DES, MYOCD, SLC30A8, LUZP2, CCDC144NL-AS1, ALKAL1
17 NRXN1, CDH19, GFRA3, PCSK2, LRRTM1, SOX2, FOXD3, PTPRZ1, LINC01505, LINC00461
18 LYVE1, MMRN1, C6orf141, RELN, PLIN5, AC007998.3, LINC02308, STAB2, AC011498.4, GPR1
19 TPSAB1, CPA3, TPSB2, MS4A2, KRT1, HDC, ADCYAP1, TPSD1, AL662860.1, LINC00323
20 CRYBA2, SCGN, SCG2, MIR7-3HG, MARCH4, CDH22, AC005256.1, LCN15, PYY, NTS

With the table above and a reference gene markers list, or a very good knowledge of biology, or a lot of Googling, you can identify each cluster. Luckly for us, this dataset already has the cell type annotated in the metadata, so we will use that to help us. Let’s first see how the clusters are distributed in the original cell type annotation.

ggplot(obj@meta.data, aes(x = seurat_clusters, y = cell_type)) +
  geom_count() +
  theme_classic()

As mentioned above, we have way less clusters and they are not as good defined as in the original publication, because we are working with just a small part of it. So, we are adding the following simplified annotation based on the overlap seen in the plot above:

Seurat cluster Cell type
0, 9, 12 B cells
1 Activated T cells
2 Enterocytes
3 T cells
4, 14, 15, 16, 18 Stromal cells
5, 7, 13 Plasma cells
8 Macrophages
6 Stem and TA cells
10 Goblet cells
19 Mast cells
11 Endothelial cells
17 Glia
20 EECs

We will add this new cell type annotation as a new column in the metadata called cell_type_new. Then, we can visualize the UMAP with this new annotation.

obj@meta.data <- obj@meta.data %>%
  mutate(cell_type_new = case_when(
    seurat_clusters %in% c(0,9,12) == TRUE ~ "B cells",
    seurat_clusters %in% c(1) == TRUE ~ "Activated T cells",
    seurat_clusters %in% c(2) == TRUE ~ "Enterocytes",
    seurat_clusters %in% c(3) == TRUE ~ "T cells",
    seurat_clusters %in% c(4, 14, 15, 16, 18) == TRUE ~ "Stromal cells",
    seurat_clusters %in% c(5, 7, 13) == TRUE ~ "Plasma cells",
    seurat_clusters %in% c(8) == TRUE ~ "Macrophages",
    seurat_clusters %in% c(6) == TRUE ~ "Stem and TA cells",
    seurat_clusters %in% c(10) == TRUE ~ "Goblet cells",
    seurat_clusters %in% c(19) == TRUE ~ "Mast cells",
    seurat_clusters %in% c(11) == TRUE ~ "Endothelial cells",
    seurat_clusters %in% c(17) == TRUE ~ "Glia",
    seurat_clusters %in% c(20) == TRUE ~ "EECs"
  ))

DimPlot(obj, group.by = "cell_type_new", label = T) + NoLegend() # this puts the labels in the plot instead of a legend on the side

Now that we have identified our cell types, we can perform some analysis.

Note

When you do your own analysis, you will have to actually identify the cell types yourself. To get help with that, check the “Cell type indentification” tutorial.

Downstream analyses

Differential expression analysis

In this dataset we have samples from individuals with Crohn’s Disease and healthy individuals, so we can investigate if there is a difference in gene expression of the cell types we identified because of an individual being sick. Let’s first just check if cells are present in both conditions by plotting the UMAP split by condition.

p <- DimPlot(obj, group.by = "cell_type_new", split.by = "Diagnosis")
print(p)

Now we can start the differential testing. We are using a method called MAST, that has been shown to perform well in single-cell data and allows us to correct for covariates such as batch, sex, age, and the cellular detection rate (the fraction of genes that are detected/expressed in each cell).

# Choose the cell type for that you want to test for
cluster <- "Enterocytes"
obj_test <- subset(obj, subset = cell_type_new == cluster)

# Set the dataset to test control vs disease
Idents(obj_test) <- "Diagnosis"

# Calculating CDR (cellular detection rate) to add in the model and correct for it
CDR <- scale(colSums(GetAssayData(obj_test, assay = "RNA", layer = "data") > 0))
obj_test$CDR <- CDR

# Finding DEGs - note how CDR is included in the model via the latent.vars parameter
DEG <- FindMarkers(obj_test, ident.1 = 'Pediatric Crohn Disease', ident.2 = 'Pediatric healthy', test.use = 'MAST', logfc.threshold = 0,
                       min.pct = 0.25, latent.vars = 'CDR')

Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...

Done!
# Create a dummy variable for gene names - it will be useful for plotting
DEG$gene <- rownames(DEG)

# Add the classification if a gene is up or down regulated based on the defined cutoffs for log2FC and adjusted p-value
DEG$DE <- NA

threshold <- 0.5 # cutff for logFC

DEG <- DEG %>%
  mutate(DE = case_when(
    p_val_adj < 0.05 & avg_log2FC > threshold ~ 'upregulated',
    p_val_adj < 0.05 & avg_log2FC < -threshold ~ 'downregulated',
    p_val_adj > 0.05 | abs(avg_log2FC) < threshold ~ 'not DE'),
    DE_gene = if_else(DE=='not significant', NA, gene)
    )

# Order and print the most upregulated genes
DEG <- DEG %>%
  arrange(desc(avg_log2FC)) # to see the most downregulated genes, just remove the desc()

knitr::kable(head(DEG))
p_val avg_log2FC pct.1 pct.2 p_val_adj gene DE DE_gene
FOLH1 0 2.534950 0.342 0.090 0e+00 FOLH1 upregulated FOLH1
HRCT1 0 1.569167 0.263 0.165 8e-07 HRCT1 upregulated HRCT1
RARRES1 0 1.451417 0.286 0.209 1e-06 RARRES1 upregulated RARRES1
HLA-DPB1 0 1.422872 0.603 0.327 0e+00 HLA-DPB1 upregulated HLA-DPB1
MTRNR2L8 0 1.328422 0.765 0.590 0e+00 MTRNR2L8 upregulated MTRNR2L8
SLC52A1 0 1.284953 0.367 0.232 2e-07 SLC52A1 upregulated SLC52A1

We can now visualize the DEG results as a volcano plot and highlight some of the genes using the code below.

color_map <- c("not DE" = "gray", "downregulated" = "#2166ac", "upregulated" = "#b2182b")

ggplot(DEG, aes(x = avg_log2FC, y = -log10(p_val_adj), color = DE, label = gene)) +
      geom_jitter() +
      geom_text_repel(max.overlaps = 5, force = 30, show.legend = FALSE) + 
      geom_hline(yintercept = -log10(0.05), colour = '#696969', linetype = 'dashed') +
      geom_vline(xintercept = -threshold, colour = '#696969', linetype = 'dashed') +
      geom_vline(xintercept = threshold, colour = '#696969', linetype = 'dashed') +
      scale_color_manual('', values = color_map) +
      labs(title = "DEGs in Enterocytes in Chron's Disease") +
      default_theme()
Warning: ggrepel: 2296 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Pathway enrichment analysis

The next question we can ask is if the up or downregulated genes are involved in common functions, like we did in the bulk RNA-seq analysis. To keep it simple, we will do just a GSEA here, but during your analysis, feel free to try the other methods, the idea is the same.

# Create the rannked gene list required for the function, this will be the DEGs ordered by log2FC
geneList_df <- DEG %>%
  filter(DE != "not significant") %>%
  dplyr::select(gene, avg_log2FC, p_val_adj) %>%
  arrange(desc(avg_log2FC))

geneList <- geneList_df$avg_log2FC
names(geneList) <- geneList_df$gene
  
gsea <- gseGO(
  geneList = geneList,
  OrgDb = org.Hs.eg.db,
  ont = "BP", 
  keyType = "SYMBOL",
  minGSSize = 10,
  pvalueCutoff = 0.05
)

This returns an object with the results that we can inspect with head(gsea@result). Although this table is good to inspect the results in detail, it is not a great way to visualize the results. For that, we can plot the most significant enriched and suppressed pathways. (In this case we only have significant suppressed pathways, so the plot will only have those, but you get the idea).

# This gets the top 5 enriched and supressed pathways based on the Normalized Enrichment Score (NES)
tmp <- arrange(gsea, desc(abs(NES))) %>%
  group_by(sign(NES)) %>%
  slice(1:5)

# Now we plot them
ggplot(tmp, showCategory = 10,
       aes(NES, fct_reorder(Description,NES), fill = -log10(p.adjust))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = '#696969', linetype = 'dashed') +
  scale_fill_gradientn(colours=c("#b3eebe", "#46bac2", "#371ea3"),
                        guide=guide_colorbar(reverse=TRUE)) +
  labs(x = 'Normalized Enrichment Score',
      y = '') +
  default_theme() +
  theme(text = element_text(size = 11))

Conclusion

This covers the basics of pre-processing a single-cell RNA-seq dataset and doing some downstream analysis to answer biological questions. Feel free to play with this code and try some different clustering parameter, use different cell types for the differential testing, etc. When you feel ready, run your own analysis with the assignment dataset.